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O ' Abstract 

(N' 

^jq' We describe and analyze a new algorithm for agnostically learning kernel-based halfspaces 

^ , with respect to the zero- one loss function. Unlike most previous formulations which rely on 

surrogate convex loss functions (e.g. hinge-loss in SVM and log- loss in logistic regression), 

we provide finite time/sample guarantees with respect to the more natural zero-one loss 

function. The proposed algorithm can learn kernel-based halfspaces in worst-case time 

poly(exp(Llog(L/e))), for any distribution, where L is a Lipschitz constant (which can be 

thought of as the reciprocal of the margin), and the learned classifier is worse than the 

optimal halfspace by at most e. We also prove a hardness result, showing that under a 

hJ ' certain cryptographic assumption, no algorithm can learn kernel-based halfspaces in time 

c/3 ' polynomial in L. 

u ' 

^, ' 1 Introduction 

^ I A highly important hypothesis class in machine learning theory and applications is that of halfspaces 

in a Reproducing Kernel Hilbert Space (RKHS). Choosing a halfspace based on empirical data is 
often performed using Support Vector Machines (SVMs) |26|. SVMs replace the more natural 0- 
1 loss function with a convex surrogate - the hinge-loss. By doing so, we can rely on convex 
optimization tools. However, there are no guarantees on how well the hinge-loss approximates the 0- 
'/^ ' 1 loss function. There do exist some recent results on the asymptotic relationship between surrogate 

^P , convex loss functions and the 0-1 loss function [11,13], but these do not come with finite-sample or 

finite-time guarantees. In this paper, we tackle the task of learning kernel-based halfspaces with 
respect to the non-convex 0-1 loss function. Our goal is to derive learning algorithms and to analyze 
them in the finite-sample finite-time setting. 

Following the standard statistical learning framework, we assume that there is an unknown 
. , distribution, V, over the set of labeled examples, X x {0,1}, and our primary goal is to find a 

S i classifier, h : X ^ {0, 1}, with low generalization error, 

erMh) '':^\ E [|Mx)-2/|]. (1) 

The learning algorithm is allowed to sample a training set of labeled examples, (xi, j/i), . . . , (x^, j/m), 
where each example is sampled i.i.d. from P, and it returns a classifier. Following the agnostic PAC 
learning framework Jj|, we say that an algorithm (e, (5)-learns a concept class H of classifiers using 
m examples, if with probability of at least 1 ~ S over a random choice of m examples the algorithm 
returns a classifier h that satisfies 

eTTv(h) < inf errxiih) + e . (2) 

heH 

We note that h does not necessarily belong to H. Namely, we are concerned with improper learning, 
which is as useful as proper learning for the purpose of deriving good classifiers. A common learning 
paradigm is the Empirical Risk Minimization (ERM) rule, which returns a classifier that minimizes 
the average error over the training set, 

_j m 

h e argmin — V" |ft.(xi) - j/i] . 
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Figure 1: Illustrations of transfer functions for L = 10 (left) and L = 3 (right): the 0-1 transfer function 
(dashed blue line); the sigmoid transfer function (dotted black line); the erf transfer function (green line); 
the piece- wise linear transfer function (dashed red line). 

The class of (origin centered) halfspaces is defined as follows. Let X he a compact subset of a 
RKHS, which w.l.o.g. will be taken to be the unit ball around the origin. Let (j)o-i : M ^ R be the 
function (j)Q_i{a) = l(a > 0) = i(sgn(a) -1-1). The class of halfspaces is the set of classifiers 

^00-1 = {XH^ 0o-i((w,x)) : w e A"} . 

Although we represent the halfspace using w e A", which is a vector in the RKHS whose dimen- 
sionality can be infinite, in practice we only need a function that implements inner products in the 
RKHS (a.k.a. a kernel function), and one can define w as the coefficients of a linear combination of 
examples in our training set. To simplify the notation throughout the paper, we represent w simply 
as a vector in the RKHS. 

It is well known that if the dimensionality of X is n, then the VC dimension of H^^^-^ equals n. 
This implies that the number of training examples required to obtain a guarantee of the form given 
in Equation ([2]) for the class of halfspaces scales at least linearly with the dimension n [2g|. Since 
kernel-based learning algorithms allow X to be an infinite dimensional inner product space, we must 
use a different class in order to obtain a guarantee of the form given in Equation ^ . 

One way to define a slightly different concept class is to approximate the non-continuous function, 
(/)o-i, with a Lipschitz continuous function, : R — > [0, 1], which is often called a transfer function. 
For example, we can use a sigmoidal transfer function 

^-(") '^' l + eM-ALa) ' ^^^ 

which is a L-Lipschitz function. Other i-Lipschitz transfer functions are the erf function and the 
piece-wise linear function: 

(t'ci-tia) = ^ {l + eri [y/n L a)) , (t)pwia) = max {min {i + L a , l} 0} (4) 

An illustration of these transfer functions is given in Figure [1] Analogously to the definition of 
-ff<^o-i' for a general transfer function (p we define H^ to be the set of predictors x i— > 0((w,x)). 
Since now the range of 4> is not {0, 1} but rather the entire interval [0, 11, we interpret (t){{w, x)) as 
the probability to output the label 1. The definition of err-p(/i) remainq^ as in Equation ([T|). 

The advantage of using a Lipschitz transfer function can be seen via Radcmacher generalization 
bounds [3]. In fact, a simple corollary of the contraction lemma implies the following: 

Theorem 1 Let e, (5 e (0,1) and let (j) be an L-Lipschitz transfer function. Let m he an integer 
satisfying 

(2L + iJ2\ni%l5)V 
m > 

V ' J 

Then, for any distribution D over X x {0, 1}, the ERM algorithm (e, 5)-learns the concept class H^ 
using m examples. 



^ Note that in this case errxi(/i) can be interpreted as P(x,j;)~i>,b~,^((w,x))[j/ / b]. 



The above theorem tells us that the sample complexity of learning H^ is n(L^/e^). Crucially, the 
sample complexity does not depend on the dimensionality of X , but only on the Lipschitz constant 
of the transfer function. This allows us to learn with kernels, when the dimensionality of X can even 
be infinite. A related analysis compares the error rate of a halfspace w to the number of margin 
mistakes w makes on the training set - see Section 14.11 for a comparison. 

From the computational complexity point of view, the result given in Theorem [1] is problematic, 
since the ERM algorithm should solve the non-convex optimization problem 

_. ni 

argmin — V |(/)((vif,x,)) - y,| . (5) 

w:||w||<i m^^ 

Solving this problem in polynomial time is hard under reasonable assumptions (see Section[3]in which 
we present a formal hardness result). Adapting a technique due to [6| we show in Appendix \K\ that 
it is possible to find an e-accurate solution to Equation ([5]) (where the transfer function is (j)pw) in 

time poly ( exp ( -p- log(-|) ) ) . The main contribution of this paper is the derivation and analysis of a 

more simple learning algorithm that (e, (5)-learns the class iJgig using time and sample complexity of 
at most poly (exp (L log(-|))). That is, the runtime of our algorithm is exponentially smaller than 
the runtime required to solve the ERM problem using the technique described in [6|. Moreover, 
the algorithm of performs an exhaustive search over all {L/ef' subsets of the m examples in 

the training set, and therefore its runtime is always order of m ''^ . In contrast, our algorithm's 
runtime depends on a parameter B, which is bounded by exp(L) only under a worst-case assumption. 
Depending on the underlying distribution, B can be much smaller than the worst-case bound. In 
practice, we will cross-validate for _B, and therefore the worst-case bound will often be pessimistic. 
The rest of the paper is organized as follows. In Section [2] we describe our main results. Next, in 
Section [3] we provide a hardness result, showing that it is not likely that there exists an algorithm 
that learns ^sig or iJpw in time polynomial in L. We outline additional related work in Sectional In 
particular, the relation between our approach and margin-based analysis is described in Section |4?T1 
and the relation to approaches utilizing a distributional assumption is discussed in Section [4.21 We 
wrap up with a discussion in Section [51 

2 Main Results 

In this section we present our main result. Recall that we would like to derive an algorithm which 
learns the class iJsig- However, the ERM optimization problem associated with iJgig is non-convex. 
The main idea behind our construction is to learn a larger hypothesis class, denoted -ffs, which 
approximately contains i?sig, and for which the ERM optimization problem becomes convex. The 
price we need to pay is that from the statistical point of view, it is more difficult to learn the class 
Hb than the class -ffsig, therefore the sample complexity increases. 

The class Hb we use is a class of linear predictors in some other RKHS. The kernel function 
that implements the inner product in the newly constructed RKHS is 

if(x,x') '=' ^, (6) 

where v E (0, 1) is a parameter and (x, x') is the inner product in the original RKHS. As mentioned 
previously, (x, x') is usually implemented by some kernel function K'{z,z'), where z and z' are the 
pre- images of x and x' with respect to the feature mapping induced by K' . Therefore, the kernel in 
Equation ^ is simply a composition with K' , i.e. K{z,z') = 1/(1 — i'K'{z,z')). 

To simplify the presentation we will set j/ ~ 1/2, although in practice other choices might 
be more effective. It is easy to verify that X is a valid positive definite kernel function (see for 
example [2l|, U^). Therefore, there exists some mapping -0 : A" — > V, where V is an RKHS with 
(V'(x),-(/'(x')) = K{x,x'). The class Hb is defined to be: 

Hb =' {x ^ (v, V^(x)) : V e V, ||vf < B} . (7) 

The main result we prove in this section is the following: 

Theorem 2 Let e,5 e (0, 1) and let L > 3. Let B = 2L^ -f exp (7Llog {'^) + 3) and let m he a 

2 



sample size that satisfies m > ^ ( ^ + 9-\/ln(8/(5) ) . Then, for any distribution D, with probability 
of at least 1 — (5, any ERM predictor h G Hb with respect to Hb satisfies 

eriT>{h) < min errx)(/isig) +e • 



We note that the bound on B is far from being the tightest possible in terms of constants and 
second-order terms. Also, the assumption of L > 3 is rather arbitrary, and is meant to simplify the 
presentation of the bound. 

To prove this theorem, we start with analyzing the time and sample complexity of learning Hb ■ 
The sample complexity analysis follows directly from a Rademacher generalization bound 3J. In 
particular, the following theorem tells us that the sample complexity of learning Hb with the ERM 
rule is order of B/e^ examples. 



Theorem 3 Let e, (5 G (0, 1), let B > 1, and let m be a sample size that satisfies 

OR / , \ 2 

m > — (2 + 9^H8/S)j . 
Then, for any distribution V, the ERM algorithm {e,d)-learns Hb- 



Proof Since K(x, x) < 2, the Rademacher complexity of Hb is bounded by yJIBjm (see also 114|). 
Additionally, using Cauchy- Schwartz inequality we have that the loss is bounded, |(v,^(x)) — j/| < 
^/2B + 1. The result now follows directly from 0, [3. ■ 

Next, we show that the ERM problem with respect to Hb can be solved in time poly(m). The 
ERM problem associated with Hb is 



mm 

v:||v||2<B TO ._ 



-^|(v,^(xO) 



Since the objective function is defined only via inner products with ih[^i\ and the constraint on 
V is defined by the £2-norm, it follows by the Representer theorem [23| that there is an optimal 
solution V* that can be written as v"^ = X]t=i ckj'0(xi)- Therefore, instead of optimizing over v, we 
can optimize over the set of weights ai, . . . , a™ by solving the equivalent optimization problem 



-J m 

min — y 



^ajK(xj,Xj) 



s.t. > OLiOLjK{yii^yi.j) < B 



This is a convex optimization problem in K™ and therefore can be solved in time poly(?7i) using 
standard optimization toolsjj We therefore obtain: 

Corollary 1 Let e,(5 G (0,1) and let B > 1. Then, for any distribution T>, it is possible to {^,5)- 
learn Hb in sample and time complexity o/poly (— log(l/(5)). 

It is left to understand why the class Hb approximately contains the class i?sig- Recall that for 
any transfer function, (j), we define the class H^ to be all the predictors of the form x i-> 0((w,x)). 
The first step is to show that Hb contains the union of H^ over all polynomial transfer functions 
that satisfy a certain boundedness condition on their coefficients. 

Lemma 1 Let Pb be the following set of polynomials (possibly with infinite degree) 

{C30 00 I 

p{a)^Y.P^-' ■■Y.PP'^A . (8) 

j=o j=o J 

Then, 

[j Hp C Hb . 

P^Pb 

Proof To simplify the proof, we first assume that X is simply the unit ball in M", for an arbitrarily 
large but finite n. Consider the mapping -0 : A" ^ R^ defined as follows: for any x e A", we let ■(/'(x) 
be an infinite vector, indexed by fci . . . , kj for all (fci, . . . ,kj) G {!,..., n}^ and j = ... 00, where 



^ In fact, using stochastic gradient descent, we can (e, (S)-learn Hb in time 0{m?), where m is as defined 
in Theorem |3] — See for example [1, [22] . 



the entry at index ki . . . ,kj equals 2 ■'''^Xki ■ x^^ ■ ■ ■ x^ ■ The inner-product between ip{x.) and V'(x') 
for any x, x' G A" can be calculated as follows, 

OO CO ^ 

(V'(x)>(x')> = Y. E 2-^x,y,^---x,^x',^ = ^2-^((x,x'))-'" = —-r-. jr- 

j=o(fci,...,fc,)e{i,...,«}j j=o -^ 2\^'^/ 

This is exactly the kernel function defined in Equation ([6|) (recall that we set i/ — 1/2) and therefore 
ip maps to the RKHS defined by K. Consider any polynomial p{a) = 'Yl'T=o Pj^'' i^ Pb, and any 
w e A". Let Vw be an element in R^ explicitly defined as being equal to (3j2^/^Wki ' ' 'Wk at index 
ki, . . . ,kj (for all ki, . . . ,kj S {1, . . . , n}^ ,j = . . . oo). By definition of ^p and Vw, we have that 

OO oo 

(vwXx)) = E E 2"^"/2/3,2^/2«;fc^ . . . ^,^a;fc, • • • • x^^ = ^ /3,((w, x))^" = p((w, x)) . 

j=Qki,...,kj j=0 

In addition, 

oo oo oo 

iivwf^E E /3|2^'<---<=E/5|2^E<E<---E<=E/3.^2^(ii-iiT<^- 

j—Oki,...,kj j—0 ki /C2 fcj j— 

Thus, the predictor x i-> (vw, V'(x)) belongs to Hb and is the same as the predictor x i-> p((w,x)). 
This proves that Hp C -ffs for all p G Pb as required. Finally, if X is an infinite dimensional RKHS, 
the only technicality is that in order to represent x as a (possibly infinite) vector, we need to show 
that our RKHS has a countable basis. This holds since the inner product (x, x') over X is continuous 
and bounded (see (Jl). ■ 

Finally, the following lemma states that with a sufficiently large i?, there exists a polynomial in 
Pb which approximately equals to 0sig- This implies that Hb approximately contains -ffsig- 

Lemma 2 Let 4>sig be as defined in Equation Q), where for simplicity we assume L > 3. For any 
e> 0, let 

B = 2L^ + exp (7i log {^^) + 3) . 
Then there exists p £ Pb such that 

Vx,vir £ X, |p((w,x)) -(/)sig((w,x))| < e . 

The proof of the lemma is based on a Chebyshev approximation technique and is given in Ap- 
pendix |B] Since the proof is rather involved, we also present a similar lemma, whose proof is 
simpler, for the (j)cri transfer function (see Appendix [C]) . It is interesting to note that t/ferf actually 
belongs to Pb for a sufficiently large B, since it can be defined via its infinite-degree Taylor expan- 
sion. However, the bound for 0cif depends on exp(L^), rather than exp(L) for the sigmoid transfer 
function 0sig. 

Finally, Theorem[2]is obtained as follows: Combining Theorem[3]and Lemma[T]we get that with 
probability of at least I — S, 

eTTx){h) < min errp(/i) + e/2 < min min errp(/i) + e/2 . (9) 

heHs p&Pb h£Hp 

From Lemma[2]we obtain that for any w G A", if /i(x) ~ (/)sig((w,x)) then there exists a polynomial 
Po G Pb such that if h'{x) = po((w,x)) then err-D(/i') < errx)(/i) + e/2. Since it holds for all w, we 
get that 

min min errx)(ft.) < min CTT-pih) + e/2 . 

Combining this with Equation ©, Theorem [5] follows. 

3 Hardness 

In this section we derive a hardness result for agnostic learning of Hgig or Hp^ with respect to the 
zero-one loss. The hardness result relies on the hardness of standard (non-agnostic jj PAC learning of 
intersection of halfspaces given in Klivans and Sherstov [17] (see also similar arguments in |12|]). The 
hardness result is representation-independent — it makes no restrictions on the learning algorithm 
and in particular also holds for improper learning algorithms. The hardness result is based on the 
following cryptographic assumption: 

■^In the standard PAC model, we assume that some hypothesis in the class has err-oih) = 0, while in the 
agnostic PAC model, which we study in this paper, err -d (ft) might be strictly greater than zero for all h £ H. 
Note that our definition of (e, 5)-learning in this paper is in the agnostic model. 



Assumption 1 There is no polynomial time solution to the 0{n )-unique-Shortest-Vector-Problem. 

In a nutshell, given a basis vi, . . . , v„ G M", the 0(n^-^)-unique-Shortest- Vector-Problem consists 
of finding the shortest nonzero vector in {aiVi + . . . + a„v„ : ai, . . . , a„ S Z}, even given the 
information that it is shorter by a factor of at least 0{n^^^) than any other non-parallel vector. This 
problem is believed to be hard - there are no known sub-exponential algorithms, and it is known to 
be NP-hard if 0{n^-^) is replaced by a small constant (see j7\ for more details). 
With this assumption, Klivans and Sherstov proved the following: 

Theorem 4 (Theorem 1.2 in Klivans and Sherstov [17]) Let X = {±1}", let 

H = {xh^0o,i((w,x) -6l-l/2):6'eN,weN",|6'| + ||w||i <poly{n)} , 

and let Hk = {x h- > (/ii(x) A ... A /ifc(x)) ; Vi, hi G i/}. Then, based on Assumptions^ H^ is not 
efficiently learnable in the standard PA C model for any k = n^ where p > is a constant. 

The above theorem implies the following. 

Lemma 3 Based on Assumption{l\ there is no algorithm that runs in time poly(n, 1/e, 1/5) and 
{e,S) -learns the class H defined in TheoremYA 

Proof To prove the lemma we show that if there is a polynomial time algorithm that learns H 
in the agnostic model, then there exists a weak learning algorithm (with a polynomial edge) that 
learns Hk in the standard (non-agnostic) PAC model. In the standard PAC model, weak learning 
implies strong learning [2C|, hence the existence of a weak learning algorithm that learns Hk will 
contradict Theorem U) 

Indeed, let T> be any distribution such that there exists h* e Hk with errx)(/i*) = 0. Let us 
rewrite h* = h\A . . . f\h*k where for all i, h* S H. To show that there exists a weak learner, we first 
show that there exists some h G H with errp(/i) < 1/2 — l/2fc^. 

Since for each x if ft,*(x) = then there exists j s.t. ft.*(x) — 0, we can use the union bound to 
get that 



E 
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1 = ¥[3j : /i*(x) = 0|/i*(x) = 0] < > P[/i*(x) = 0|/i*(x) = 0] < fcmaxP[/i*(x) = 0|/i*(x) = 0] 



So, for j that maximizes P[/i*(x) = 0|/i*(x) = 0] we get that P[/i*(x) = 0|/i*(x) = 0] > 1/fc. 
Therefore, 

eiTv{h*) = P[/i*(x) = 1 A h*{x) = 0] = P[h*{x) = 0] P[/i*(x) = l|/i*(x) = 0] 

= P[/i*(x) = 0] (1 - P[/i*(x) = 0|/i*(x) = 0]) < P[/i*(x) = 0] (1 - 1/fc) . 

Now, if P[/i*(x) = 0] < 1/2 + l/fc2 then the above gives 

erivih*) < (1/2 + l/k^)il - 1/fc) < 1/2 - 1/2^^ , 



where the inequality holds for any positive integer k. Otherwise, if P[/i*(x) = 0] > 1/2 + 1/fc^, then 
the constant predictor /i(x) = has err-p(/i) < 1/2 — 1/fc^. In both cases we have shown that there 
exists a predictor in H with error of at most 1/2 — l/2fc^. 

Finally, if we can agnostically learn H in time poly(n, 1/e, 1/5), then we can find h' with 
errx)(/i') < min^ig// errp(/i) -l- e < 1/2 — l/2fc^ -|- e in time poly(n, 1/e, 1/5) (recall that k = n^ 
for some p > 0). This means that we can have a weak learner that runs in polynomial time, and 
this concludes our proof. ■ 

Let /i be a hypothesis in the class H defined in Theorem 2] and take any x e {±1}". Then, 
there exist an integer 6 and a vector of integers w such that /i(x) = 0o.i((w,x) — 9 — 1/2). But 
since (w,x) — 6 \s also an integer, if we let L = 1 this means that /i(x) = 0pw((w,x) ~ 9 — 1/2) 
as well. Furthermore, letting x' e R"+^ denote the concatenation of x with the constant 1 and 
letting w' E M"+^ denote the concatenation of w with the scalar {—9 — 1/2) we obtain that h{x.) = 
0pw((w',x')). Last, let us normalize w = w'/||w'||, x = x/||x'|j, and redefine L to be ||w'|| ||x'||, we 
get that h{x) = 0pw((w, x)). That is, we have shown that H is contained in a class of the form iJpw 
with a Lipschitz constant bounded by poly(n). Combining the above with Lemma |3] we obtain the 
following: 



Corollary 2 Let L be a Lipschitz constant and let Hp„ be the class defined by the L-Lipschitz 
transfer function 4'pw- Then, based on Assumption]^ there is no algorithm that runs in time 
poly(i, 1/e, 1/J) and {t,5)-learns the class H-p^. 

A similar argument leads to the hardness of learning -ffsig- 

Theorem 5 Let L be a Lipschitz constant and let Hsig be the class defined by the L-Lipschitz transfer 
function (j^sig- Then, based on Assumption]^ there is no algorithm that runs in time poly(L, 1/e, 1/5) 
and {e,d)-learns the class i?sig- 

Proof Let h he a. hypothesis in the class H defined in Theorem U] and take any x e {±1}". Then, 
there exist an integer 9 and a vector of integers w such that /i(x) = (/>o,i((w, x) — — 1/2). However, 
since (w, x) —9 is also an integer, we see that 

|0o,i((w,x) - 9 - 1/2) - 0sig((w,x) -0-1/2)1 < ^ 



1 + exp(2L) 



This means that for any e > 0, if we pick L — °^^ ^^~ ' and define /isig(x) = 0sig((w, x)— — 1/2), 
then |ft.(x) — /isig(x)| < e/2. Furthermore, letting x' G ]R"+^ denote the concatenation of x with 
the constant 1 and letting w' G K"+-'^ denote the concatenation of w with the scalar (—6* — 1/2) we 
obtain that /isig(x) = (/)sig((w',x')). Last, let us normalize w = w'/||w'||, x = x/||x'||, and redefine 
L to be 

^ ^ ||w-||||x1|log(2/e-l) ^^^^ 

so that /isig(x) — 0sig((w, x)). Thus we see that if there exists an algorithm that runs in time 
poly(L, 1/e, 1/5) and (e/2, 5)-learns the class -ffsig, then since for all /i G iJ exists /isig G -ffsig such 
that |/isig(x) — /i(x)| < e/2, there also exists an algorithm that (e, (5)-learns the concept class Li 
defined in Theorem |4] in time polynomial in (L, 1/e, 1/5) (for L defined in Equation I10|) . But by 
definition of L in Equation [TOl and the fact that ||w'|| and ||x'|| are of size poly(ri), this means that 
there is an algorithm that runs in time polynomial in (n, 1/e, 1/5) and (e,5)-learns the class iJ, 
which contradicts Lemma [3] ■ 



4 Related work 

The problem of learnin g ke rnel-based halfspaces has been extensively studied before, mainly in the 
framework of SVM |26lll0l l2l|. When the data is separable with a margin [i, it is possible to learn 
a halfspaces in polynomial time. The learning problem becomes much more difficult when the data 
is not separable with margin. 

In terms of hardness results, [q derive hardness results for proper learning with sufficiently small 
margins. There are also strong hardness of approximation results for proper learning without margin 
(see for example |13j and the references therein). We emphasize that we allow improper learning, 
which is just as useful for the purpose of learning good classifiers, and thus these hardness results do 
not apply. Instead, the hardness result we derived in Section [3] hold for improper learning as well. 
As mentioned before, the main tool we rely on for deriving the hardness result is the representation 
independent hardness result for learning intersections of halfspaces given in [l3| . 

Practical algorithms such as SVM often replace the 0-1 error function with a convex surrogate, 
and then apply convex optimization tools. However, there are no guarantees on how well the 
surrogate function approximates the 0-1 error function. Recently, [281 |j] studied the asymptotic 
relationship between surrogate convex loss functions and the 0-1 error function. In contrast, in this 
paper we show that even with a finite sample, surrogate convex loss functions can be competitive 
with the 0-1 error function as long as we replace inner-products with the kernel K{x,x') = 1/(1 — 
0.5(x,x')). 

4.1 Margin analysis 

Recall that we circumvented the dependence of the VC dimension of H^^-^ on the dimensionality 
of X by replacing ^o-i with a Lipschitz transfer function. Another common approach is to require 
that the learned classifier will be competitive with the margin error rate of the optimal halfspace. 
Formally, the /z-margin error rate of a halfspace of the form h^{x.) — l((w, x) > 0) is defined as: 

erri,,^(w) = Pr[/iw(x) ^ y V |(w, x)| < /i] . (11) 



Intuitively, errx),^(w) is the error rate of /iw had we /z-shifted each point in the worst possible way. 
Margin based analysis restates the goal of the learner (as given in Equation 1^) and requires that 
the learner will find a classifier h that satisfies: 

errp(/i) < min erv-p^p.i'w) + e . (12) 

w: II w|| — 1 

Bounds of the above form are called margin-based bounds and are widely used in the statistical anal- 
ysis of Support Vector Machines and AdaBoost. It was shown d, [T^l that m = 9(log(l/J)/(/xe)^) 
examples are sufficient (and necessary) to learn a classifier for which Equation (J12p holds with prob- 
ability of at least 1 — S. Note that as in the sample complexity bound we gave in Theorem [U the 
margin based sample complexity bound also does not depend on the dimension. 

In fact, the Lipschitz approach used in this paper and the margin-based approach are closely 
related. First, it is easy to verify that if we set L = l/(2/j,), then for any w the hypothesis 
/i(x) = 0pw((w, x)) satisfies errp(/i) < errx)_^(w). Therefore, an algorithm that (e, 5)-learns TJpw also 
guarantees that Equation ([T^ holds. Second, it is also easy to verify that if we set L = -!- log (^7^) 
then for any w the hypothesis h{x) = (/)sig((w,x)) satisfies errx)(/i) < err-D_p(w) 4- e/2. Therefore, 
an algorithm that (e/2, ^)-learns i^sig also guarantees that Equation ([T2|) holds. 

As a direct corollary of the above discussion we obtain that it is possible to learn a vector w 
that guarantees Equation (fT2)) in time poly(exp(0(l//i))). 

A computational complexity analysis under margin assumptions was first carried out in [6| (see 
also the hierarchical worst-case analysis recently proposed in [5]). The technique used in _6] is based 
on the observation that in the noise-free case, an optimal halfspace can be expressed as a linear sum 
of at most 1//J.^ examples. Therefore, one can perform an exhaustive search over all sub-sequences of 
1//Lt^ examples, and choose the optimal halfspace. Note that this algorithm will always run in time 
m^'^ . Since the sample complexity bound requires that m will be order of l/(/^e)^, the runtime of 
the method described by [g] becomes poly(exp(0(l/^^))). In comparison, our algorithm achieves a 
better runtime of poly(exp(0(l//i))). Moreover, while the algorithm of [6] performs an exhaustive 
search, our algorithm's runtime depends on the parameter B, which is poly(exp(0(l//i))) only under 
a worst-case assumption. Since in practice we will cross- validate for B, it is plausible that in many 
real-world scenarios the runtime of our algorithm will be much smaller. 

4.2 Distributional Assumptions 

The idea of approximating the zero-one transfer function with a polynomial was first proposed 
by [15 who studied the problem of agnostically learning halfspaces without kernels in R" under 
distributional assumption. In particular, they showed that if the distribution over X is uniform over 
the unit ball, then it is possible to agnostically learn H^g_^ in time poly(n^/'^ ). This was further 
generalized by [3], who showed that similar bounds hold for product distributions. 

Beside distributional assumptions, these works are characterized by explicit dependence on the 
dimension of X^ and therefore are not adequate for the kernel-based setting we consider in this paper, 
in which the dimensionality of X can even be infinite. More precisely, while [15i] try to approximate 
the zero-one transfer function with a low-degree polynomial, we require instead that the coefficients 
of the polynomials are bounded. The principle that when learning in high dimensions "the size of 
the parameters is more important than their number" was one of the main advantages in the analysis 
of the statistical properties of several learning algorithms (e.g. y ). 

Interestingly, in |23| we show that the very same algorithm we use in this paper recover the same 
complexity bound of |15| . 

5 Discussion 

In this paper we described and analyzed a new technique for agnostically learning kernel-based 
halfspaces with the zero-one loss function. The bound we derive has an exponential dependence 
on L, the Lipschitz coefficient of the transfer function. While we prove that (under a certain 
cryptographic assumption) no algorithm can have a polynomial dependence on L, the immediate 
open question is whether the dependence on L can be further improved. 

A perhaps surprising property of our analysis is that we propose a single algorithm, returning 
a single classifier, which is simultaneously competitive against all transfer functions p E Pb- In 
particular, it learns with respect to the "optimal" transfer function, where by optimal we mean the 
one which attains the smallest error rate, E[|p((w,x)) — ?;|], over the distribution V. 

Our algorithm boils down to linear regression with the absolute loss function and while composing 
a particular kernel function over our original RKHS. It is possible to show that solving the vanilla 



SVM, with the hinge-loss, and composing again our particular kernel over the desired kernel, can 
also give similar guarantees. It is therefore interesting to study if there is something special about 
the kernel we propose or maybe other kernel functions (e.g. the Gaussian kernel) can give similar 
guarantees. 

Another possible direction is to consider other types of margin-based analysis or transfer func- 
tions. For example, in the statistical learning literature, there are several definitions of "noise" 
conditions, some of them are related to margin, which lead to faster decrease of the error rate as 
a function of the number of examples (see for example d, [2^ [13]). Studying the computational 
complexity of learning under these conditions is left to future work. 

Acknowledgments 

We would like to thank Adam Klivans for helping with the Hardness results. This work was partially 
supported by a Google Faculty Research Grant. 

References 

[1] C. Thomas- Agnan A. Berlinet. Reproducing Kernel Hilbert Spaces in Probability and Statistics. 
Springer, 2003. 

[2] P. L. Bartlett. For valid generalization, the size of the weights is more important than the size 
of the network. In Advances in Neural Information Processing Systems 9, 1997. 

[3] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and 
structural results. Journal of Machine Learning Research, 3:463-482, 2002. 

[4] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. 
Journal of the American Statistical Association, 101:138-156, 2006. 

[5] S. Ben-David. Alternative measures of computational complexity. In TAMC, 2006. 

[6] S. Ben-David and H. Simon. Efficient learning of linear perceptrons. In NIPS, 2000. 

[7] E. Blais, R. O'Donnell, and K Wimmer. Polynomial regression under arbitrary product distri- 
butions. In COLT, 2008. 

[8] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In NIPS, pages 161-168, 
2008. 

[9] O. Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis 
of Learning Algorithms. PhD thesis, Ecolc Poly technique, 2002. 

[10] N. Cristianini and J. Shawe- Taylor. Kernel Methods for Pattern Analysis. Cambridge University 
Press, 2004. 

[11] D. Elliot. The evaluation and estimation of the coefficients in the chebyshev series expansion 
of a function. Mathematics of Computation, 18(86):274-284, April 1964. 

[12] V. Feldman, P. Gopalan, S. Khot, and A.K. Ponnuswami. New results for learning noisy 
parities and halfspaces. In In Proceedings of the 47th Annual IEEE Symposium on Foundations 
of Computer Science, 2006. 

[13] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. In Proceedings 
of the 47th Foundations of Computer Science (FOCS), 2006. 

[14] S.M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk 
bounds, margin bounds, and regularization. In NIPS, 2008. 

[15] A. Kalai, A.R. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. In 
Proceedings of the 46th Foundations of Computer Science (FOCS), 2005. 

[16] M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward efhcient agnostic learning. In COLT, 
pages 341-352, July 1992. To appear. Machine Learning. 

[17] Adam R. Klivans and Alexander A. Sherstov. Cryptographic hardness for learning intersections 
of halfspaces. In FOCS, 2006. 



[18] J.C. A/[ason. Chebyshev Polynomials. CRC Press, 2003. 

[19] D. A. McAUcster. Simplified PAC-Bayesian margin bounds. In COLT, pages 203-215, 2003. 

[20] R.E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197-227, 1990. 

[21] B. Scholkopf and A. J. Sniola. Learning with Kernels: Support Vector Machines, Regularization, 
Optimization and Beyond. MIT Press, 2002. 

[22] S. Shalev-Shwartz and N. Srebro. SVM optimization: Inverse dependence on training set size. 
In International Conference on Machine Learning, pages 928-935, 2008. 

[23] S. Shalev-Shwartz, O. Shamir, and K. Sridharan. Agnostically learning halfspaces with margin 
errors. Technical report, Toyota Technological Institute, 2009. 

[24] I. Steinwart and C. Scovel. Fast rates for support vector machines using gaussian kernels. 
Annals of Statistics, 35:575, 2007. 

[25] A. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statistics, 32: 
135-166, 2004. 

[26] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. 

[27] G. Wahba. Spline Models for Observational Data. SIAM, 1990. 

[28] T. Zhang. Statistical behavior and consistency of classification methods based on convex risk 
minimization. The Annals of Statistics, 32:56-85, 2004. 



A Solving the ERM problem given in Equation ([5]) 

In this section we show how to approximately solve Equation ([5]) when the transfer function is 0pw 
The technique we use is similar to the covering technique described in Q. 

For each i, let bi = 2{yi — 1/2). It is easy to verify that the objective of Equation ([5]) can be 
rewritten as 

-. rrt 

— V/(fe,(w,Xj)) where /(a) =min{l,max{0,l/2-La}} . (13) 

m ^ — ' 

Let g{a) = max{0, 1/2 — L a}. Note that g ia a convex function, g{a) > f{a) for every a, and equality 
holds whenever a > —1/2L. 

Let w* be a minimizer of Equation (I13p over the unit ball. We partition the set [m] into 

h = {ie H : g(fo^(w*,x,)) = /(&,(w*,x,))} , h = H \ h ■ 
Now, let w be a vector that satisfies 

y^ g{bi{-w,:x.i)) < min V" g(6i(w, Xi)) + em . (14) 

^ — ^ w:||w||<l ^^ — ^ 



ie/i ieii 



Clearly, we have 



^/(&,(w,x,))<5^g(&,(w,x,)) + ^/(6,;(w,x,)) 

i—1 i£li iG/2 

<^g(6,(W,x,)) + |/2| 

< ^g(6,(w*,x,))+em+|/2| 

m 

Dividing the two sides of the above by m we obtain that w is an e-accurate solution to Equation (J13l) . 
Therefore, it suffices to show a method that finds a vector w that satisfies Equation (fTl]). To do so, 
we use a standard generalization bound (based on Rademacher complexity) as follows: 



Lemma 4 Let us sample ii, . . . ,ik i.i.d. according to the uniform distribution over Ii. Let \v be a 



minimizer of'^-^g{bi.{'w,x.i^)) over w in the unit ball. Then, 



E 



T77Tyi5(^i(w,x.,))- min t^ V g(fe,(w, x.,)) 

I ^1 -* — ' ^v; ^v <1 ' ' — 



where expectation is over the choice of ii, . . . ,ik. 



< 2L/Vk , 



Proof Simply note that g is L-Lipschitz and then apply a Rademacher generalization bound with 
the contraction lemma. ■ 

The above lemma immediately implies that if fc > 4L^/e^, then there exist ii,...,ik in /i such 
that if w S argmin.^^,. 11.^^,11 <]^ ^i=i 9{^ij (w, x^^)) then w satisfies Equation ((Ti)) and therefore it is an 
e-accurate solution of Equation (|13p . The procedure will therefore perform an exhaustive search over 
all ii, . . . ,ik in [m], for each such sequence the procedure will find w € argmin.j,,.||.,j,||<]^ X]i=i ff(^ij {^^ ^ij)) 
(in polynomial time). Finally, the procedure will output the w that minimizes the objective of 
Equation (IT5|) . The total runtime of the procedure is therefore poly(m'^). Plugging in the value of 
k = \AL'^ /e'^l and the value of m according to the sample complexity bound given in Theorem[T]we 
obtain the total runtime of 



poly [{L/ef'/^') = poly (exp (^ log(L/e))) 



B Proof of Lemma [2] 

In order to approximate (^sig with a polynomial, we will use the technique of Chebyshev approxi- 
mation (of. |18i]'). One can write any continuous function on [— 1,+1] as a Chebyshev expansion 
J2'^=o '^nTn{-), where each Tn{-) is a particular n-th degree polynomial denoted as the n-th. Cheby- 
shev polynomial (of the first kind). These polynomials are defined as Tq{x) = l,Ti{x) = x, and 
then recursively via Tn+i{x) = 2xTn(x) — T„_i(x). For any n, T„(-) is bounded in [—1,-1-1]. The 
coefficients in the Chebyshev expansion of ^sig are equal to 

an = / , „ dx. (15) 

Truncating the series after some threshold n = N provides an A^-th degree polynomial which ap- 
proximates the original function. 

In order to obtain a bound on B, we need to understand the behavior of the coefficients in 
the Chebyshev approximation. These are determined in turn by the behavior of «„ as well as the 
coefficients of each Chebyshev polynomial T'„(-). The following two lemmas provide the necessary 
bounds. 

Lemma 5 For any n> \, |a„| in the Chebyshev expansion of (f>sig on [—1, +1] is upper bounded as 
follows: 

, ^ l/2L + l/n 

Also, we have |ao| < 1, |ai| < 2. 

Proof The coefficients a„, n = 1, . . . in the Chebyshev series are given explicitly by 

an = -[' ^^^^^Btldx. (16) 

For ao, the same equality holds with 2/tt replaced by I/tt, so ao equals 

1 r ^sig(x)_^^^ 

-1 



TI" Jx=-1 Vl 



which by definition of (j>sig{x), is at most (I/tt) / __, (Vl ~ x^) dx = 1. As for ai, it equals 

2 [^ (j>sig{x)x 



=-1 Vl — x' 



dx, 



whose absolute value is at most {2/tt) J^^_i (Vl — x'^) dx = 2. 

To evaluate the integral in Equation (|16p f or g eneral n and L, we will need to use some tools 
from complex analysis. The calculation follows [ll|, to which we refer the reader for justification of 
the steps and further detailslj. 

On the complex plane, the integral in Equation (J16p can be viewed as a line integral over [—1, +1]. 
Using properties of Chebyshev polynomials, this integral can be converted into a more general 
complex- valued integral over an arbitrary closed curve C on the complex plane which satisfies certain 
regularity conditions: 

1 f ^.,giz)dz 



where the sign in ± is chosen so that \z ± V-z^ — 1| > 1. In particular, for any parameter p > 1, the 
set of points z satisfying \z ± \/z^ — 1| = p form an ellipse, which grows larger with p and with foci 
at z = ±1 and which grows larger with p. Since we are free to choose C, we choose it as this ellipse 
while letting p — > oo. 

To understand what happens when p — >■ oo, we need to characterize the singularities of (j)sigiz), 
namely the points z where (j)sig{z) is not well defined. Recalling that (j)sig{z) = (1 -|- e~'^^^)~^, we 
see that the problematic points are i(7r -|- 27rfc)/4L for any k — ±1,±2, . . ., where the denominator 
in (j)sig{z) equals zero. Note that this forms a discrete set of isolated points - in other words, (f>sig is 



*We note that such calculations also appear in standard textbooks on the subject, but they are usually 
carried under asymptotic assumptions and disregarding coefficients which are important for our purposes. 



a meromorphic function. The fact that 0sig is 'well behaved' in this sense allows us to perform the 
analysis below. 

The behavior of the function at its singularities is defined via the residue of the function at 
each singularity c, which equals limz^dz — c)4>sig{z) assuming the limit exists (in that case, the 
singularity is called a simple pole, otherwise a higher order limit might be needed). In our case, the 
residue for the singularity at iir/AL equals 

,. z ,. z ,. l/4i 

lim : rr- = hm rr- = hm — r-r- — l/4i, 

where we used I'Hopital's rule to calculate the limit. The same residue also apply to all the other 
singularities. 

For points in the complex plane uniformly bounded away from these singularities, |0sig(2)| is 
bounded, and therefore it can be shown that the integral in Equation ([T7| will tend to zero as we 
let C become an arbitrarily large ellipse (not passing too close to any of the singularities) by taking 
p — )■ cx). However, as p varies smoothly, the ellipse does cross over singularity points, and these 
contribute to the integral. For meromorphic functions, with a discrete set of isolated sing ularities, 
we can simply sum over all contributions, and it can be shown (see equation 10 in [Tl| and the 
subsequent discussion) that 



^n=-2 ^ 



rk 



k=-oo y/zl - 1 (zk ± \J z\ - 1 

where Zk is the singularity point i(7r + 2'Kk)/4,L with corresponding residue r^. Substituting the 
results for our chosen function, we have 



E 



1/4L 



k=-oo ^ {i{TT ^ 2-Kk) / ALf - 1 ii{Tl + 27rfc)/4i ± ^j {i{TT + 2TTk) / Ahf - l\ 

A routine simplification leads to the following^: 



E 



1/4L 



fc=-oo i"+iy^((7r + 27r/c)/4L)^ + 1 ( (tt + 27rfc)/4L ± ^J ({-k + 2T:k) / Ahf + l\ 
It can be verified that ± should be chosen according to l(fc > 0). Therefore, 

fc=-°o ^{{-K + 27r/c)/4L)^ + 1 I Itt + 27rA:|/4L + ^J {{j: + 27rfc)/4L)^ + 1 j 
l/AL 1/4L ^ 1/4L 



< V -^^^ < -^^^ + 2V- 

^-^ (k + 27rA:|l/4L+l)" - (l + 7r/4i)" ^^ (1 



k=- 



oo 



(Itt + 27rA:|l/4L + 1)" ~ (1 + 7r/4i)" ^-^ (1 + 7r(l + 2k)/ AL)'' 



- (1 + tt/ALY Jk^o (1 + ^(1 + 2fc)/4L)" 



Solving the integral and simplifying gives us 



1 / , l + 7r/4L 



(1 + 7r/4L)" V ' 7r(n - 1) 

Since n > 2, the result in the lemma follows. 



^On first look, it might appear that a„ takes imaginary values for even n, due to the i"^^ factor, despite 
a„ being equal to a real-valued integral. However, it can be shown that On — for even n. This additional 
analysis can also be used to slightly tighten our final results in terms of constants in the exponent, but it 
was not included for simplicity. 



Lemma 6 For any non-negative integer n and j = 0, 1, . . . , n, let i„j- be the coefficient of x^ in 
Tn{x). Then tnj — for any j with a different parity than n, and for any j > 0, 



o"+i 



\tnj\ < 



2tt 



Proof The fact that t„ j = for j,n with different parities, and |i„,ol < 1 is standard. Using an 
exphcit formula from the hterature (see |18l |. pg. 24), as well as Stirling approximation, we have 
that 






< 



2^n f n + j 



i!("' + j) 



n{n + jy n{n + jy neJ 

{n + j)jl ~ {n + j)y/2TTj{j/ey {n + j)./hij y ' j 



< 



{n + j)^/2Trj 
from which the lemma follows. 



We are now in a position to prove a bound on B. As discussed earlier, (j>sig{x) in the domain 
[—1,+!] equals the expansion '^n=Q'^"'^x- The error resulting from truncating the Chebyshev 
expanding at index TV, for any x G [—1, +1], equals 



N 



i{x) - ^a„r„(a 



n=0 



y^ a„T„(a;) 



,1=^+1 



< 



E I 

Ti=Ar+l 



where in the last transition we used the fact that |T„(a;)| < 1. Using Lemma[5]and assuming A^ > 0, 
this is at most 

l/2L + l/n 2 + 4L/7r 



^2_._^^ (l + 7r/4i)" " nil+ir/iL)^- 



In order to achieve an accuracy of less than e in the approximation, we need to equate this to e and 
solve for N, i.e. 



N ^ 



logi+^ 



/4L 



2 + AL/tt 



(18) 



The series left after truncation is X]n=o '^"-^"(•^)' '^hich we can write as X]i=oft^"'' Using 
Lemma [S] and Lemma [51 the absolute value of the coefficient /3j for j > 1 can be upper bounded by 



E 

n—j..N,n—j mod 2 

(1/2L + l/7r)e^" 



flnll^njl < 



E 



2tt 



E 



n^j..N,n—j mod 2 

e 



1/2L + 1/tt e"+J' 
(1 + 7r/4L)» y^ 



n—j..N,n—j mod 2 



1 + 7r/4L 



{l/2L+l/7r)e^ ( e 



< 



2tt 
{1/2L+ l/7r)eJ 



1 + tt/AL 



] LT- 



^„ U+TT 



/4L 



2n 



27r 



(e/(l + 7r/4£))^-J+^-l 
1 + ^4^7 (e/(l + 7r/4i))2- 1 ' 



Since we assume L > 3, we have in particular e/(l + tt/AL) > 1, so we can upper bound the 
expression above by dropping the 1 in the numerator, to get 



l/2L+l/7r 



27r((e/(l + 7r/4L))2 - 1) \1 + n/AL 



N+2 



The cases /3o, /3i need to be treated separately, due to the different form of the bounds on ao, ai. 
Repeatmg a similar analysis (using the actual values of in,i, ^n,o for any n), we get 

, 1 2L 

/3o < 1 + - + — 

3(l + 2L/7rK4L+^ 
^^-^+ 2^^ • 

Now that we got a bound on the /3j, we can plug it into the bound on B, and get 

N N . i/ori1/ \2/ \ 2^+4 

,t^ '- ° ' ^t^VV2^((e/(l + V4i))2-l); Vi + V4iy ^ ' 
o2 .o2 f l/2L+l/7r Vf e ^"+"(262^^+1 



27r((e/(l + 7r/4L))2-l); Vl + 'r/4i/ ^2-1 



= /?2 + 2/32 + 



2(l/2L+l/7r)2ef' / V2i 



2 



2Ar 



(e2 - l)27r((e/(l + 7r/4L))2 - 1)2(1 + 7r/4i)4 I 1 + tt/AL 



To make the expression more readable, we use the (rather arbitrary) assumption that L > 3. In 
that case, by some numerical calculations, it is not difficult to show that we can upper bound the 
above by 

2N 



V2( 



2 



2X^ + 0.15 ^ , <2L^ + Q.l^{2eY- 

ll + 7r/4L/ 



Combining this with Equation (|18p. we get that this is upper bounded by 

2L^ + 0.15(2e'^)'°si+''/"(^^)+\ 

or at most 

'log(2e4)log(^±i^) 



2L^ + exp 



log(l + 7r/4L) 



Using the fact that log(l + a;) > a; — x2 for x > 0, and the assumption that L > 3, wc can bound 
the exponent by 

log(2e4)log(2±i^) 

^ ^ ^ + 3 < 71og(2L/e)L + 3. 

4L V^ 8L/ 

Substituting back, we get the result stated in Lemma [2] 

C The 0erf() Function 

In this section, we prove a result anaologous to Lemma [H using the 0orf(') transfer function. In a 
certain sense, it is stronger, because we can show that ^orf (•) actually belongs to Pb for sufficiently 
large B. However, the resulting bound is worse than Lemma [2 as it depends on exp(L2) rather 
than exp(L). However, the proof is much simpler, which helps to illustrate the technique. 
The relevant lemma is the following: 

Lemma 7 Let i/icrf (■) &e as defined in Equation Ol), where for simplicity we assume L > 3. For any 
e> 0, let 

B < - + 2L^ (1 + ineL'^e^''^^) . 
Then (/icrf(-) G Pb- 

Proof By a standard fact, 0erf(-) is equal to its infinite Taylor series expansion at any point, and 
this series equals 

^ , , 1 I ^ (-l)"(0FLa)2«+i 

2 W-K ^—' n\(2n + 1 

^ n—O ^ ^ 



Luckily, this is an infinite degree polynomial, and it is only left to calculate for which values of B 
does it belong to Pb- Plugging in the coefficients in the bound on B, we get that 

- 4^7r^ (n!)2(2n+l)2 - 4^7r^ (n!)2 

n=0 ^ ■> ^ ' ' n=0 ^ ' 

(27rL2)2"\ 



\ n— 1 



{njef 
27reL^ 



Thinking of (^'KeL? jn)^'^ as a continuous function of n, a simple derivative exercise shows that it 
is maximized for n — 2'kL'^^ with value e'*'^^ . Therefore, we can upper bound the series in the 
expression above as follows: 



«=1 »=1 ^ n=r2v^freL2] 



2n 



oo / 1 \ " 

< 2\/27reL2e'''''^' + ^ ~ ) - STreL^e 



'2„47rL^ 

,2 

n=r2V27reL2] 

where the last transition is by the assumption that L > 3. Substituting into the bound on B, we 
get the result stated in the lemma. ■ 



